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To the editor, 


The coronavirus is an enveloped, positive-sense, single-stranded RNA virus. It could 
be classified into four major genera: Alphacoronavirus, Betacoronavirus, 
Gammacoronavirus and Deltacoronavirus, based on serological and genetic studies 
(Li, 2016). The Alphacoronavirus and Betacoronavirus mainly infect mammals, 
whereas the Gammacoronavirus and Deltacoronavirus mainly infect avians (Tang et 
aL, 2015). The coronavirus poses a serious threat to human health and global security 
because several coronaviruses could cross-species to infect humans, such as the 
Severe Acute Respiratory Syndrome Coronavirus (SARS-CoV) and Middle East 
Respiratory Syndrome Coronavirus (MERS-CoV) (Lu et al., 2015; Peck et al., 2015; 
Smith, 2006). The SARS-CoV was reported to cause 774 human deaths in 37 
countries from 2002 to 2003 (Smith, 2006), while the MERS-CoV is still persistently 
infecting humans in many countries and have already caused more than 700 deaths 
around the world (World Health Organization, 2017). How to prevent and control the 
coronavirus has become a global concern. 

The genome of the coronavirus generally encodes more than ten proteins (Peck et al., 
2015; Yang et al., 2013). Among them, the spike surface envelope glycoprotein is 
responsible for binding to host receptors and determines the tissue tropism and host 
range of the virus to a large extent (Li, 2015, 2016; Luet aL, 2015). The spike protein 
contains anectodomain, a transmembrane anchor and a short intracellular tail. Among 


them, the ecotodomain could be cleaved into a receptor-binding SI subunit and a 




membrane-fusion S2 subunit during molecular maturation. The SI subunit binds to a 


host receptor for entry into the host cell (Li, 2015, 2016; Qian et aL, 2015). 
Depending on the corona virus species, the spike protein could bind to either protein 
receptors or glycans (Li, 2016). Multiple receptors were reported for the corona virus. 
This is largely attributed to the double receptor-binding domains (RBD) on the SI 
subunit: one RBD is located in the N-terminal (denoted as NTD), while the other is 
located in the C-terminal (denoted as CTD) (Li, 2016). One coronavirus species 
generally uses one RBD. Some coronaviruses used NTD, for example, the mouse 
hepatitis virus (MHV) (Peng et al., 2011), while the others used CTD, such as 
SARS-CoV (Lu et al., 2015) and MERS-CoV (Lu et al., 2015). Previous studies 
suggest that the usage of two RBDs could facilitate expansion of host range of the 
virus (Li, 2015, 2016). However, the mechanism under the RBD usage is still obscure. 
Besides, RBD usage of most coronavirus species is still unknown. Here, we attempted 
to develop a computational method for determining RBD usage of the coronavirus 
based on the protein sequence of SI. 

We firstly manually compiled twelve coronavirus species with RBD usage reported 
from the literature (Table SI). Four coronavirus species used NTD, including the 
bovine coronavirus (BCoV), MHV, IBV and the human coronavirus OC43 
(HCoV-OC43), while the other eight coronavirus species used CTD, including the 
human coronavirus 229E (HCoV-229E), feline coronavirus (FCoV), bat coronavirus 
HKU4 (BatCoV-HKU4), human coronavirus HKU1 (HCoV-HKUl), human 


coronavirus NL63 (HCoV-NL63), MERS-CoV, SARS-CoV and transmissible 




gastroenteritis virus (TGEV). The protein sequences of the spike protein SI subunit of 


these viruses were collected from the NCBI protein database. For convenience, only 
800 amino acids in the N-terminal of each spike protein sequence, which covered the 
SI subunit of all coronavirus species, were kept for further analysis (Supplementary 
Methods). 

Then, the frequency of kmers (one or two amino acids) was used individually to 
predict whether a coronavirus used NTD or CTD for binding to the receptor (see 
Supplementary Methods and Table S2). Most of them achieved a predictive accuracy 
ranging from 0.6 to 0.8. Surprisingly, we found a pair of amino acids, i.e., “FS”, could 
discriminate the RBD usage of these 12 coronavirus species with an average 
predictive accuracy of 97% (Fig. 1A). More specifically, it achieved an accuracy of 
100% for BCoV, MHV, HCoV-OC43, BatCoV-HKU4, HCoV-HKUl, HCoV-NF63 
and TGEV, and an accuracy of 0.94, 0.87, 0.99, 0.99 and 0.92 for IBV, HCoV-229E, 
FCoV, MERS-CoV and SARS-CoV, respectively. Analyzing the number of “FS” in 
the protein sequence of SI subunit of these viruses, we found that the viruses using 
NTD generally had less than 3 “FS”s in SI expect for IBV, while the viruses using 
CTD generally had 6 or more “FS”s in SI (Fig. 1A). 

Further analysis of the ratio between the observed and expected number of “FS” in SI 
protein of these viruses showed that the “FS” was under-represented in the viruses 
using NTD (Fig. SI), i.e., the observed number of “FS” in SI was lower than that of 


the expected; while for the viruses using CTD, the “FS” was generally 




over-represented inSl. We next analyzed the location of “FS”s on the 3D structure of 


SI protein of the corona virus. Figure 1B&C show the 3D structures for SI protein of 
MHV and HCoV-NL63 respectively. For most corona virus species, the “FS”s (colored 
in blue) were generally scattered around the SI protein (Fig. 1B&C and Fig. S2). Few 
of them were located in or near the receptor-binding interface (colored in red), 
suggesting that “FS” may not contribute directly to the virus-receptor interaction. One 
exception is the SARS-CoV, for which there was one “FS” in the interface (Fig. S2A). 
More efforts are needed to clarify how does the “FS” influence the RBD usage of the 
coronavirus. 

Finally, except for 12 coronavirus species mentioned above, we inferred the RBD 
usage of all other coronavirus species which had SI protein sequence available in the 
NCBI protein database (Table S3), based on the number of “FS” in S1 protein. A total 
of 31 coronavirus species covering all four major genera were used in prediction. For 
the virus in Alphacoronavirus, except for the Mink coronavirus 1, all the other 
coronavirus species were predicted to use CTD; while for other genera, most 
coronavirus species were predicted to use NTD. 

Overall, this work provides a simple and effective method for inferring the RBD 
usage of the coronavirus based on the protein sequence of the spike protein. It may 
not only help understand the mechanisms behind the RBD usage of the coronavirus, 


but also help for identification of host receptors for the vims. 
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Figure Legends 

Figure 1 Predicting the RBD usage of the coronavirus based on the number of “FS” 


in the protein sequence of the spike protein SI subunit. (A) The distribution of the 
number of “FS” in SI and the predictive accuracy based on the number of “FS” in 12 
coronavirus species. The coronavirus species using NTD and CTD were colored in 
blue and red, respectively. The genus each virus belongs to was labeled in the top 
right of the virus name. (B) and (C) refer to the 3D structure of SI subunit for MHV 
and HCoV-NL63, respectively. The receptor-binding interface was inferred manually 
from the spike-receptor complex (PDB id: 3r4d for MHV and 3kbh for HCoV-NL63). 
NTD and CTD were colored in cyan and yellow respectively. The “FS”s were colored 
in blue. 




ACCEPTED MANUSCRIPT 


Figure 1 





0 J 


Predictive 

accuracy 


BCoV P 


1 


MHV P 


1 


IBV 


,v HCoV- 


OC43 


0.94 


HCoV- a .a BatCoV- P HCoV- HCoV- 
229E HKU4 HKU1 NL63 


0.87 0.99 1 1 1 


MERS- P 

CoV 


0.99 


SARS- P 

CoV 


0.92 


TGEV 0 


1 













